03 Kiva Loans: Analysis in Two Variables

In this file, we begin to consider relationships between variables. We have two basic types of questions that we’re interested in:

  1. What are the relationships between loan initialization variables? Questions of this type will lead us toward understanding how loans vary from country to country, demographic to demographic, how descriptions relate to loan amounts, and so forth.
  2. What are the inter-relationships between loan initialization variables and variables down the pipeline? Questions of this type will lead us toward understanding how the features of the loan – amount, country, demographics – relate to how that loan is funded, disbursed, and repaid.

Given the negative values in the disbursal_timediff variable, however, we discussed cutting the supply chain in two and examining the “posting-to-funding” component separately from the “disbursal-to-repayment” component.

We also generated some data partitioning ideas at the end of our single-variable EDA, namely:

Let’s load the data and get started.

library(jsonlite)
library(ggplot2)
library(psych)
library(RColorBrewer)
library(countrycode)

loans = jsonlite::fromJSON("newpaid.json")

loans$posted_datetime = strptime(loans$posted_datetime, format = "%Y-%m-%d %H:%M:%S")
loans$funded_datetime = strptime(loans$funded_datetime, format = "%Y-%m-%d %H:%M:%S")
loans$disbursal_datetime = strptime(loans$disbursal_datetime, format = "%Y-%m-%d %H:%M:%S")
loans$paid_datetime = strptime(loans$paid_datetime, format = "%Y-%m-%d %H:%M:%S")

We know that we have no missing data, so we’ll dive right in.

Relationships among loan features

First, we’ll generate a scatterplot matrix for the numeric loan feature variables:

pairs.panels(loans[, c(8, 9, 17, 18)])

Most pairs of variables are not correlated. There is no relationship between the length of a loan’s description and its amount, the number of borrowers, or the gender composition of its borrowers, nor is the gender composition of the borrowers related to the amount or number of borrowers. (The apparent structure in the number-female plot is merely an artifact of integer denominators.) The only correlation here is between the loan amount and the number of borrowers. Although many loans, regardless of amount, have a single borrower – perhaps one representative for a larger group – the general trend is that a larger loan means more borrowers.

We see a few high-leverage points, where the loan amount is greater than 5000 USD, but these do not seem to be throwing off the overall trends.

Next, we want to consider categorical features of the loan, country and sector, in relation to our other loan variables. Let’s look at sector and loan amount first:

ggplot(data=loans, aes(x=sector, y=terms.loan_amount, colour=sector)) + 
  geom_boxplot() + scale_colour_discrete(guide=FALSE) +
  xlab("Sector") + ylab("Loan amount") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5,size=6))

For the bulk of the loans, there isn’t much difference between sectors: the mass is concentrated below about 1250 USD. Sector distinctions seem to matter more when we look at the tails. The education, food, manufacturing, and service sectors all have multiple loans that extend well beyond 1250 USD. We can examine this segment of loans values by itself:

large_loans = loans[loans$terms.loan_amount >= 1250, ]
ggplot(data=large_loans, aes(x=sector, y=terms.loan_amount, colour=sector)) + 
  geom_boxplot() + scale_colour_discrete(guide=FALSE) +
  xlab("Sector") + ylab("Loan amount") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5,size=6))

Even for this segment, though, the median and IQRs by sector are comparable. Perhaps segmenting by loan amount won’t prove interesting after all.

How does the sector relate to the number of borrowers? Loans with one borrower will dominate the plot, so let’s look only at loans with more than one borrowers:

ggplot(data=loans[loans$b.num>1, ], aes(x=sector, y=b.num, colour=sector)) + 
  geom_boxplot() + scale_colour_discrete(guide=FALSE) +
  xlab("Sector") + ylab("Number of borrowers") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5,size=6))

Again, the ranges are quite comparable. Entertainment sector is the only real stand-out.

Is there a relationship between gender and sector?

ggplot(data=loans, aes(x=sector, y=b.female, colour=sector)) + 
  geom_boxplot() + scale_colour_discrete(guide=FALSE) +
  xlab("Sector") + ylab("Fraction of females in borrowing group") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5,size=6))

Yes, there are certain sectors where the borrowers are almost exclusive female: arts, clothing, food, and retail. This is the exception, however. All in all, sector doesn’t seem to reveal much when it comes to the characteristics of a loan. We can quantify its effect on the loan amount:

summary.lm(aov(terms.loan_amount ~ sector, loans))
## 
## Call:
## aov(formula = terms.loan_amount ~ sector, data = loans)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -862.9 -430.4 -192.6  187.1 9282.4 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          716.6243    13.7199  52.232  < 2e-16 ***
## sectorArts            95.4826    42.6293   2.240  0.02512 *  
## sectorClothing       152.0449    27.3214   5.565 2.67e-08 ***
## sectorConstruction    98.5481    44.4358   2.218  0.02659 *  
## sectorEducation      163.6535    77.0980   2.123  0.03380 *  
## sectorEntertainment  133.3757   180.4577   0.739  0.45986    
## sectorFood             0.9781    18.1311   0.054  0.95698    
## sectorHealth         182.1045    67.6632   2.691  0.00713 ** 
## sectorHousing         -8.4243    39.6187  -0.213  0.83162    
## sectorManufacturing   98.8635    52.1076   1.897  0.05781 .  
## sectorPersonal Use    46.9174    74.7286   0.628  0.53012    
## sectorRetail          17.2302    18.6083   0.926  0.35449    
## sectorServices       171.2642    25.7246   6.658 2.89e-11 ***
## sectorTransportation -36.1917    35.0792  -1.032  0.30222    
## sectorWholesale       31.9051   124.1948   0.257  0.79726    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 719.7 on 13763 degrees of freedom
## Multiple R-squared:  0.007283,   Adjusted R-squared:  0.006273 
## F-statistic: 7.212 on 14 and 13763 DF,  p-value: 3.571e-15

The very low adjusted R-squared means a negligible effect size.

What about country? Does it help explain the characteristics of a loan? In the plot below, countries are colored by continent.

# helper function
get_continent = function(df){
  countrycode(df$location.country_code, origin="iso2c", destination="continent")
}
ggplot(data=loans, aes(x=location.country_code, y=terms.loan_amount, 
                       fill=get_continent(loans))) + geom_boxplot() + 
  xlab("Country") + ylab("Loan amount") +
  scale_fill_discrete(name="Continent") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5, size=6))

Here, we see a significant variance in loan amount, within and across countries. Some countries have a median loan amount of over 3000 USD (the U.S., notably), even as many have a median of under 1000 USD. Even for these latter countries, however, there can be a large number of loans in the upper quartile. (These are the dots lying outside the boxplots.) African countries and countries from the Americas tend to have a lower median loan amount than do Asian countries.

Let’s remove countries that don’t show up much in our sample before analyzing how country relates to other loans variables.

high_countries = table(loans$location.country_code) >= 30
hc_vec = high_countries[loans$location.country_code] == TRUE
hc_loans = loans[hc_vec, ] 

An ANOVA shows that country does a better job of explaining the variance in loan amount:

summary.lm(aov(terms.loan_amount ~ location.country_code, hc_loans))
## 
## Call:
## aov(formula = terms.loan_amount ~ location.country_code, data = hc_loans)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1651.1  -269.8   -97.5   151.5  4348.2 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               704.78      70.69   9.970  < 2e-16 ***
## location.country_codeAM  1246.28     110.58  11.271  < 2e-16 ***
## location.country_codeAZ   531.89      79.50   6.691 2.31e-11 ***
## location.country_codeBJ  -213.33      89.32  -2.388 0.016936 *  
## location.country_codeBO   563.67      76.85   7.335 2.35e-13 ***
## location.country_codeCM  -153.35     121.27  -1.265 0.206055    
## location.country_codeCO   -13.18      88.62  -0.149 0.881743    
## location.country_codeCR   231.38     105.19   2.200 0.027852 *  
## location.country_codeDO   870.02      87.72   9.919  < 2e-16 ***
## location.country_codeEC   115.05      76.80   1.498 0.134164    
## location.country_codeGH  -135.53      76.17  -1.779 0.075237 .  
## location.country_codeGT   288.79      99.26   2.910 0.003625 ** 
## location.country_codeHN  -123.10      89.03  -1.383 0.166801    
## location.country_codeID  -187.85      92.40  -2.033 0.042060 *  
## location.country_codeIQ  1545.22     118.07  13.088  < 2e-16 ***
## location.country_codeKE  -231.28      73.51  -3.146 0.001659 ** 
## location.country_codeKG   626.35      90.57   6.916 4.87e-12 ***
## location.country_codeKH    64.92      72.92   0.890 0.373343    
## location.country_codeLB   549.71      81.01   6.785 1.21e-11 ***
## location.country_codeLR  -260.60      98.25  -2.653 0.007998 ** 
## location.country_codeML   132.63      86.07   1.541 0.123327    
## location.country_codeMN   591.03      81.99   7.209 5.95e-13 ***
## location.country_codeMX   207.07      76.45   2.708 0.006767 ** 
## location.country_codeMZ  -190.80     103.72  -1.840 0.065846 .  
## location.country_codeNG   -95.35      77.40  -1.232 0.218010    
## location.country_codeNI  -157.57      73.74  -2.137 0.032639 *  
## location.country_codeNP  -468.35     121.27  -3.862 0.000113 ***
## location.country_codePE   -77.97      72.11  -1.081 0.279603    
## location.country_codePH  -403.84      71.89  -5.618 1.97e-08 ***
## location.country_codePK   182.14      82.10   2.219 0.026528 *  
## location.country_codePS  1165.03     106.81  10.907  < 2e-16 ***
## location.country_codePY   668.48      81.06   8.247  < 2e-16 ***
## location.country_codeQS  -332.23      84.96  -3.910 9.26e-05 ***
## location.country_codeRW   177.18      84.55   2.096 0.036127 *  
## location.country_codeSL   126.58      84.00   1.507 0.131866    
## location.country_codeSN   338.81      82.32   4.116 3.88e-05 ***
## location.country_codeSV   -46.79      80.05  -0.585 0.558850    
## location.country_codeTG    95.73      79.94   1.198 0.231093    
## location.country_codeTJ   236.75      75.35   3.142 0.001682 ** 
## location.country_codeTZ    56.71      78.67   0.721 0.471017    
## location.country_codeUA   541.20      94.36   5.736 9.92e-09 ***
## location.country_codeUG   413.66      75.53   5.477 4.42e-08 ***
## location.country_codeVN    80.09      82.10   0.976 0.329289    
## location.country_codeWS   -81.76      84.08  -0.972 0.330854    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 582.9 on 13501 degrees of freedom
## Multiple R-squared:  0.2405, Adjusted R-squared:  0.2381 
## F-statistic: 99.44 on 43 and 13501 DF,  p-value: < 2.2e-16

The adjusted R-squared is 0.24, much better than the same statistic for sector. (If we include all countries, it climbs to 0.32.)

How does country relate to number of borrowers?

ggplot(data=hc_loans, aes(x=location.country_code, y=b.num, 
                       fill=get_continent(hc_loans))) + geom_boxplot() + 
  scale_fill_discrete(name="Continent") + 
  xlab("Country") + ylab("Number of borrowers") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5, size=6))

Again, loans with one borrower dominate, yet there are some countries where the number of borrowers tends to range more widely, such as Paraguay (PY).

Does country relate to the gender composition of the borrowers?

ggplot(data=hc_loans, aes(x=location.country_code, y=b.female, 
                       fill=get_continent(hc_loans))) + geom_boxplot() + 
  scale_fill_discrete(name="Continent") + 
  xlab("Country") + ylab("Fraction of females in borrower group") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5, size=6))

In many countries, the loans are dominated by female borrower groups, but in many others, there are male groups as well. The most interesting feature of the above plot is probably the countries where there are a large number of mixed-gender borrower groups: Cambodia (KH), Peru (PE), The Philippines (PH), and Tanzania (TZ).

Finally, we can look at how our two categorical variables, country and sector, relate to one another. In our univariate EDA, we saw that agriculture, food, and retail were the main sector represented in the loan data. Is this true for each country?

getPalette = colorRampPalette(brewer.pal(11, "Spectral"))

num_sectors = length(unique(hc_loans$sector))

ggplot(data=hc_loans, aes(x=location.country_code, fill=sector)) + 
  geom_histogram(position="fill") + 
  scale_fill_manual(values = getPalette(num_sectors)) + 
  xlab("Country") + ylab("Frequency") + 
  ggtitle("Loan frequency by country and sector") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5,size=6))

#ggsave("Loans_by_country_and_sector.png")

Not exactly. Those are the three largest sectors, but their relative frequencies differ from country to country. In some countries, agriculture accounts for well over 50% of loans, or even 100% of loans; in others, agriculture loans are outnumbered by food and retail loans.

The number of sectors can make it hard to be sure of these frequencies, so let’s look at the same plot but only for these top three sectors:

top_sectors = c("Agriculture", "Food", "Retail")
top_sector_loans = hc_loans[hc_loans$sector %in% top_sectors, ]

ggplot(data=top_sector_loans, aes(x=location.country_code, fill=sector)) + 
  geom_histogram(position="fill") + 
  scale_fill_manual(values = getPalette(3)) + 
  xlab("Country") + ylab("Frequency") +
  ggtitle("Loan frequency by country and sector") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5,size=6))

Now the distinctions are clear. This suggests that when we move on to multivariate analysis, we try conditioning on the borrower’s country before examining how sector relates to loan properties.

Time series

We have not yet looked at time series. Let’s do that now. We note that loans are not evenly spaced in time, so any time series analysis would land us in complex territory. However, we can get a sense of the overall activity by looking at how the loan amount relates to the posted_datetime variable:

ggplot(data=loans, aes(x=posted_datetime, y=terms.loan_amount)) + geom_line() +
  xlab("Time") + ylab("Loan Amount") + theme_bw()

The sporadic early activity suggests we not use dates early than 2007. Let’s narrow the window and use a smoothing function to get a better sense of the trend:

loans07 = loans[loans$posted_datetime > strptime("2007-01-01", format="%Y-%m-%d"), ]

ggplot(data=loans07, aes(x=posted_datetime, y=terms.loan_amount)) + 
  geom_smooth(lwd=2, level=0.99) +
  xlab("Time") + ylab("Loan Amount") + theme_bw()
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

Total loan amount is growing, as we would expect with a growing business. How do the number of borrowers change over time?

ggplot(data=loans07, aes(x=posted_datetime, y=b.num)) + 
  geom_smooth(lwd=2, level=0.99) +
  xlab("Time") + ylab("Number of borrowers") + theme_bw()

Generally, this value goes up. And what about the gender composition of the borrowers?

ggplot(data=loans07, aes(x=posted_datetime, y=b.female)) + 
  geom_smooth(lwd=2, level=0.99) +
  xlab("Time") + ylab("Fraction of females in borrower group") + theme_bw()

This value climbs for the first year – meaning that, on average, a greater fraction of people in the borrowing group are female – and then dips for the next three years.

How do loan amounts change over time for the top three sectors?

ggplot(data=loans07[loans07$sector %in% top_sectors, ],
       aes(x=posted_datetime, y=terms.loan_amount, colour=sector)) +
  geom_smooth(lwd=2, level=0.99) + xlab("Time") + ylab("Loan Amount") + 
    scale_colour_discrete(name="Sector") +
  theme_bw()

And finally, we ask: How do loan amounts change over time by country? To avoid cluttering the plot, we can look at only the top five countries, in terms of how many loans they received:

top_countries = sort(table(loans07$location.country_code), decreasing=T)[1:5]

ggplot(data=loans07[loans07$location.country_code %in% names(top_countries), ], 
       aes(x=posted_datetime, y=terms.loan_amount, colour=location.country_code)) + 
  geom_smooth(lwd=2, level=0.99) + xlab("Time") + ylab("Loan Amount") + 
  scale_colour_discrete(name="Country") + theme_bw()

Kenya had a spike in 2008 in which they took higher-value loans, while the per-loan value for other countries has been increasingly more stably – Nicaragua and Cambodia, in particular.

Relationships between loan and pipeline features

Let’s revisit our scatterplot matrix, only now we will include “pipeline features” – that is, the timediffs that represent the difference in days between the different stages in the loan pipeline. We can also add in the number of lenders.

pairs.panels(loans[, c(8, 9, 17, 18, 5, 14, 15, 16)])

Recall that we are separating the funding component of the pipeline from the repayment component. We observe that the loan amount and the funding time are correlated, albeit weakly; this is not surprising. Also not surprising is the correlation between loan amount and lender count, although the strength of it suggests that the individual contribution per loan is effectively fixed. What is somewhat more surprising is the correlation (again, weak) between funding time and gender. This correlation is negative, meaning that as the fraction of females in a borrowing group increases, the funding time tends to drop. There are similar trends, though even weaker, between the same variables and the repayment time.

We can confirm that there is a significant difference in the all-male and not-all-male populations when it comes to funding time using a t-test:

t.test(loans$funded_timediff ~ loans$b.fem>0)
## 
##  Welch Two Sample t-test
## 
## data:  loans$funded_timediff by loans$b.fem > 0
## t = 21.8486, df = 3817.287, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  3.715813 4.448433
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##            7.136061            3.053939

There is definitely a difference in mean funding time: 4.08 ± 0.37 days. Given that the mean funding time for all-male borrowing groups is 7.14 days, this effect size is rather large. Recall that the median funding time is quite low: 0.71 days. (The mean is 3.99 days.) Since high-value loans tend to drive up the funding time, and since males tend to be associated with higher loan amount (albeit very weakly), we wonder whether this effect holds when we remove larger loans from consideration. Taking out the upper quartile:

normal_loans = loans[loans$terms.loan_amount <= quantile(loans$terms.loan_amount)[4], ]
t.test(normal_loans$funded_timediff ~ normal_loans$b.fem>0)
## 
##  Welch Two Sample t-test
## 
## data:  normal_loans$funded_timediff by normal_loans$b.fem > 0
## t = 16.486, df = 2596.384, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.837099 3.603109
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##            5.554105            2.334001

It does. The difference in means is now closer to 3 days, but the effect is real.

Moving on: The number of borrowers and the length of the description have no apparent effect on how quickly a loan is funded (or repaid). Lender count, meanwhile, is correlated with variables that one might expect: the loan amount, the number of borrowers, and the funding time. It is not correlated with the gender composition of the borrowing group. We can quickly see, too, that the dollar per lender is completely uncorrelated with gender:

cor(loans$b.female, loans$terms.loan_amount/loans$lender_count)
## [1] -0.01249957

If gender does have an impact on lenders’ behavior, it is not in how much they give, but rather on whether they give – and the more people that do give, the more quickly the loan is funded.

Now we ask how sector and country relate to pipeline variables. We’ll begin with sector and funding time:

ggplot(data=loans, aes(x=sector, y=funded_timediff, 
                       colour=sector)) + geom_boxplot() + 
  scale_colour_discrete(guide=FALSE) + 
  xlab("Sector") + ylab("Funding time") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5, size=6))

The median funding time for all sector is quite low, with slightly larger IQRs for sectors that one might expect: housing and transportation. The length of the each sector’s tail is the main distinguishing feature.

Does the number of lenders vary by sector?

ggplot(data=loans, aes(x=sector, y=lender_count, 
                       colour=sector)) + geom_boxplot() + 
  scale_colour_discrete(guide=FALSE) + 
  xlab("Sector") + ylab("Number of lenders") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5, size=6))

Not really. As we saw before, sector and loan amount aren’t very related, and since loan amount is reasonably correlated with the number of lenders, the lack of a relationship here isn’t surprising.

Does sector help explain any variance in repayment time?

ggplot(data=loans, aes(x=sector, y=paid_timediff, 
                       colour=sector)) + geom_boxplot() + 
  scale_colour_discrete(guide=FALSE) + 
  xlab("Sector") + ylab("Repayment time") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5, size=6))

Again, the answer is not really.

We move on to examining the relationship of the borrower’s country to the funding time.

hc_loans_0 = hc_loans[hc_loans$funded_timediff >= 0, ]
ggplot(data=hc_loans_0, 
       aes(x=location.country_code, y=funded_timediff, 
                       fill=get_continent(hc_loans_0))) + geom_boxplot() + 
  scale_fill_discrete(name="Continent") + 
  xlab("Country") + ylab("Funding time") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5, size=6))

There are a few countries whose median value is much higher than the global median, which is a mere 0.71 days. These counties include:

long_fund = sort(tapply(hc_loans_0$funded_timediff, 
            hc_loans_0$location.country_code, median), decreasing=T)[1:10]
countrycode(names(long_fund), origin="iso2c", destination="country.name")
##  [1] "Iraq"                            "Palestine, State of"            
##  [3] "Lebanon"                         "Armenia"                        
##  [5] "Colombia"                        "El Salvador"                    
##  [7] "Azerbaijan"                      "Bolivia, Plurinational State of"
##  [9] "Kyrgyzstan"                      "Tajikistan"

How does the number of lenders relate to the borrowing country?

ggplot(data=hc_loans, aes(x=location.country_code, y=lender_count, 
                       fill=get_continent(hc_loans))) + geom_boxplot() + 
  scale_fill_discrete(name="Continent") + 
  xlab("Country") + ylab("Number of lenders") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5, size=6))

Again, we see a fair amount of variance in the medians. Which countries require the most lenders?

high_lend_ct = sort(tapply(hc_loans_0$lender_count, 
            hc_loans_0$location.country_code, median), decreasing=T)[1:10]
countrycode(names(high_lend_ct), origin="iso2c", destination="country.name")
##  [1] "Iraq"                "Armenia"             "Palestine, State of"
##  [4] "Kyrgyzstan"          "Dominican Republic"  "Mongolia"           
##  [7] "Lebanon"             "Ukraine"             "Costa Rica"         
## [10] "Azerbaijan"

Some of these names appeared in the previous list, but others are new. If we look at the median loan amount by country, we can get a sense for why (some of) these countries require more lenders and more funding time:

high_loan_amt = sort(tapply(hc_loans_0$terms.loan_amount, 
            hc_loans_0$location.country_code, median), decreasing=T)[1:10]
countrycode(names(high_loan_amt), origin="iso2c", destination="country.name")
##  [1] "Iraq"                "Palestine, State of" "Armenia"            
##  [4] "Dominican Republic"  "Azerbaijan"          "Lebanon"            
##  [7] "Kyrgyzstan"          "Mongolia"            "Ukraine"            
## [10] "Costa Rica"

The answer is that these countries have some of the biggest loan amounts. (There are countries that have even larger loans, but fewer than 30 loans in number, such as the United States.)

Finally, we examine repayment time by country.

ggplot(data=hc_loans, aes(x=location.country_code, y=paid_timediff, 
                       fill=get_continent(hc_loans))) + geom_boxplot() + 
  scale_fill_discrete(name="Continent") + 
  xlab("Country") + ylab("Repayment time") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5, size=6))

Here, the disparities are quite significant. If we wish to predict how long it will take a borrower to pay back a loan, that borrower’s country could certainly be helpful.

How do these new pipeline variables relate to the calendar time? We can check.

ggplot(data=loans07,
       aes(x=posted_datetime, y=funded_timediff)) +
  geom_smooth(lwd=2, level=0.99) + xlab("Time") + ylab("Funding time") +
  theme_bw()

Funding time has been slowly increasing – but not by much. The maximum toward the end of record keeping could stand some investigation.

Do we see a corresponding increase in the repayment time?

ggplot(data=loans07,
       aes(x=posted_datetime, y=paid_timediff)) +
  geom_smooth(lwd=2, level=0.99) + xlab("Time") + ylab("Repayment time") + 
    scale_colour_discrete(name="Sector") +
  theme_bw()

From mid-2009 on, the answer is yes: repayment time is increasing. However, there is a different unexplained maximum toward the start of recording keeping.

Rates

We have seen that there is a weak correlation between the loan amount and certain pipeline features, particularly how long it takes a loan to get funded and repaid, and a strong correlation between the loan amount and the lender count. These are correlations we might have expected. It is natural at this point to create certain rate variables and see whether more nuanced patterns can be detected. We’ll consider a funding rate (in USD per day), a repayment rate (in USD per day), a lender rate (in USD per lender), and a borrower rate (in USD per borrower). These variables can help us to discover, for example, whether certain types of loans tend to be associated with a greater individual contribution or a slower repayment rate.

loans$f.rate = loans$terms.loan_amount / loans$funded_timediff
loans$l.rate = loans$terms.loan_amount / loans$lender_count
loans$r.rate = loans$terms.loan_amount / loans$paid_timediff
loans$b.rate = loans$terms.loan_amount / loans$b.num

We can get some summary statistics for each of our new features:

summary(loans$f.rate)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##      6.28    173.40    772.00   2583.00   2745.00 181800.00

The median funding rate is 772 USD per day; given that the median loan amount is 550 USD, the median funding time of about 0.71 day is explainable. What’s going on with that extreme value?

loans[which.max(loans$f.rate), ]
##          id status      sector partner_id lender_count
## 12292 87088   paid Agriculture         48            1
##       description.languages location.country_code terms.loan_amount
## 12292                es, en                    BO               600
##       description.avgchar     posted_datetime     funded_datetime
## 12292                 794 2009-02-04 12:10:08 2009-02-04 12:14:55
##             paid_datetime  disbursal_datetime funded_timediff
## 12292 2010-02-15 11:56:36 2009-01-21 08:00:00          0.0033
##       disbursed_timediff paid_timediff b.num b.female   f.rate l.rate
## 12292            -14.177      390.1643     1        0 181818.2    600
##         r.rate b.rate
## 12292 1.537814    600

Here, we have a modest loan (600 USD) that got fully funded by one lender four minutes after it was posted. This observation reminds us that rates can be unstable.

The median lender rate, meanwhile, is about 30 USD per lender per loan, with an (enforced) minimum of 25 USD.

summary(loans$l.rate)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   25.00   27.11   30.30   39.05   36.25 1262.00
loans[which.max(loans$l.rate), ]
##          id status sector partner_id lender_count description.languages
## 8820 286467   paid Retail         30            2                ru, en
##      location.country_code terms.loan_amount description.avgchar
## 8820                    AZ              2525                 414
##          posted_datetime     funded_datetime       paid_datetime
## 8820 2011-03-29 15:50:03 2011-03-31 22:08:57 2012-08-15 09:59:50
##       disbursal_datetime funded_timediff disbursed_timediff paid_timediff
## 8820 2011-02-28 08:00:00          2.2631           -31.5479      534.0416
##      b.num b.female   f.rate l.rate   r.rate b.rate
## 8820     1        0 1115.726 1262.5 4.728096   2525

The maximum lender rate was for one loan of 2525 USD that was funded by two lenders.

What about the borrower rate?

summary(loans$b.rate)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##     4.167   250.000   425.000   560.600   750.000 10000.000

This distribution ranges much more widely than the lender rate, and we can see from the size of the fourth quartile that it is very positively skewed. Some loans were large and taken out by a single borrower; others were spread out amongst many borrowers, to the tune of 4 USD per person.

Lastly, we look at the repayment rate.

summary(loans$r.rate)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.03846  1.15400  1.91700  2.97400  3.12400 52.86000

Some repayment times are negative, as we saw in our univariate analysis; this account for the negative values. Aside from that, we see another positively skewed distribution, with a modest median of 1.92 USD per day.

We can re-examine our correlation matrix with our rates:

pairs.panels(loans[, c(8, 17, 18, 14, 16, 19:22)])

On the lender side: The meager correlations we saw earlier between (numerical) loan features and pipeline features vanish when we look at rates. Neither the gender composition of the borrowing group nor the number of borrowers appears related to how quickly a loan is funded or how much each individual gives. The loan amount is also uncorrelated with these rates. (Note that if we remove loans that fall in the unstable upper quartile of funding rates, the correlation between gender and funding rate reappears, weakly, at r = 0.16.) One new relationship that we do see is the weak positive correlation between borrower rate and funding time, but when we compare borrower rate to funding rate, there is no correlation. The likely explanation here is that a higher borrower rate (more money per borrower) tends to mean a higher loan amount and thus a longer funding time; but since there is no relationship between loan amount and funding rate, the effect disappears when we consider rates.

These vanishing correlations suggest that if we wish to discover something interesting about how loan characteristics relate to lender behavior, we examine a subset of the loans – perhaps higher-value loans. As it stands, the value of most loans is small enough relative to the minimum lending amount that they admit a wide variance in lender behavior. Before trying that idea out, we’ll look at how country and sector relate to rates, however.

On the borrower side: We do notice some strong correlations between repayment rate and loan amount, as well as between repayment rate and number of borrowers. (Loan amount and number of borrowers are themselves strongly correlated, of course.) So larger loans tend to be repaid at a greater rate. There is also a weak negative correlation between borrower rate and gender composition: as the fraction of females in a borrowing group increases, the per-borrower amount tends to decrease. Finally, we note the moderate correlation between the borrower rate and the loan amount. This correlation seems at odds with some other correlations. As the loan amount increases, how can both the number of borrowers and the loan amount per borrower tend to increase? They could, of course, both increase, if each borrower were taking on more financial responsibility, but what seems to be happening here is that there are two sub-populations: multi-borrower groups and single-borrower groups. The former is driving the line of best fit in the plot of borrower number vs. loan amount, while the latter is driving the line of best fit in the plot of borrower rate vs. loan amount. We should consider splitting out these two populations in the future.

Let’s now examine how country and sector relate to these rates.

Rates by country

We’ll begin with country and funding rate.

hc_loans$f.rate = hc_loans$terms.loan_amount / hc_loans$funded_timediff
hc_loans$l.rate = hc_loans$terms.loan_amount / hc_loans$lender_count
hc_loans$r.rate = hc_loans$terms.loan_amount / hc_loans$paid_timediff
hc_loans$b.rate = hc_loans$terms.loan_amount / hc_loans$b.num

ggplot(data=hc_loans, aes(x=location.country_code, y=f.rate, 
                       fill=get_continent(hc_loans))) + geom_boxplot() + 
  scale_fill_discrete(name="Continent") + 
  xlab("Country") + ylab("Funding rate") +  ylim(c(0,3000)) +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5, size=6))

The medians differ by as much as 1000 USD, but the ranges of the IQRs overlap more often than not, suggesting that the effect is due to chance.

What happens when we look at lender rate?

ggplot(data=hc_loans, aes(x=location.country_code, y=l.rate, 
                       fill=get_continent(hc_loans))) + geom_boxplot() + 
  scale_fill_discrete(name="Continent") + 
  xlab("Country") + ylab("Lender rate") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5, size=6))

Again, low-value loans and 25 USD contributions dominate, so the individual contribution rate doesn’t differ much on the basis of the borrower’s country.

We move on to borrower rate by country:

ggplot(data=hc_loans, aes(x=location.country_code, y=b.rate, 
                       fill=get_continent(hc_loans))) + geom_boxplot() + 
  scale_fill_discrete(name="Continent") + 
  xlab("Country") + ylab("Borrower rate") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5, size=6))

We see more variance here. A lot of Asian countries, in particular, have high per-borrower loan amounts.

Next, we look at repayment rate by country:

ggplot(data=hc_loans, aes(x=location.country_code, y=r.rate, 
                       fill=get_continent(hc_loans))) + geom_boxplot() + 
  scale_fill_discrete(name="Continent") + 
  xlab("Country") + ylab("Repayment rate") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5, size=6))

Again, there is a fair amount of variance.

For each of these rates, we can quantify how much variance is explained by the borrower’s country using the adjusted R squared value from an ANOVA:

summary.lm(aov(l.rate ~ location.country_code, hc_loans))$adj.r.squared
## [1] 0.00711905
summary.lm(aov(f.rate ~ location.country_code, hc_loans))$adj.r.squared
## [1] 0.009379178
summary.lm(aov(b.rate ~ location.country_code, hc_loans))$adj.r.squared
## [1] 0.4242269
summary.lm(aov(r.rate ~ location.country_code, hc_loans))$adj.r.squared
## [1] 0.1549973

So a borrower’s country does not seem to have much of a relationship to lender rates (adjusted R^2 = 0.007) or funding rates (0.01) – that is, how much each individual lenders give and how much funding per day a loan receives, respectively – but it does have a moderate relationship to borrower rates (0.42) and a weak relationship to repayment rates (0.16) – that is, the value of the loan per borrower and how much of a loan is repaid per day, respectively.

Rates by sector

Does sector have an impact of either of our rates? Again, we begin with the funding rate, excluding the upper quartile:

ggplot(data=loans, aes(x=sector, y=f.rate, 
                       color=sector)) + geom_boxplot() + 
  scale_colour_discrete(guide=FALSE) +  
  xlab("Sector") + ylab("Funding rate") + ylim(c(0,3000)) +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5, size=6))

Some sectors (e.g. entertainment, housing, personal use) definitely get funded more slowly than do others (e.g. arts, education, health).

Does sector have any impact on the lender rate? We know that 25 USD contributions will dominate, so let’s look only at loans that had lender rates above 50 USD per lender.

ggplot(data=loans, aes(x=sector, y=l.rate, 
                       color=sector)) + geom_boxplot() + 
  scale_colour_discrete(guide=FALSE) + 
  xlab("Sector") + ylab("Lending rate") + ylim(c(50, 1000)) +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5, size=6))

The medians are all comparable, though we can see that the most popular sectors – agriculture, food, and retail – do inevitably attract some generous lenders. We also notice that no entertainment loans averaged 50 USD per lender o more.

Does sector explain any of the variance in borrower rate?

ggplot(data=loans, aes(x=sector, y=b.rate, 
                       color=sector)) + geom_boxplot() + 
  scale_colour_discrete(guide=FALSE) + 
  xlab("Sector") + ylab("Borrower rate") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5, size=6))

It doesn’t seem to. What about repayment rate?

ggplot(data=loans, aes(x=sector, y=r.rate, 
                       color=sector)) + geom_boxplot() + 
  scale_colour_discrete(guide=FALSE) + 
  xlab("Sector") + ylab("Repayment rate") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5, size=6))

Again, sector doesn’t seem useful. We can get numerical estimates, as we did above:

summary.lm(aov(l.rate ~ sector, loans))$adj.r.squared
## [1] 0.0003695259
summary.lm(aov(f.rate ~ sector, loans))$adj.r.squared
## [1] 0.008509044
summary.lm(aov(b.rate ~ sector, loans))$adj.r.squared
## [1] 0.014931
summary.lm(aov(r.rate ~ sector, loans))$adj.r.squared
## [1] 0.01053561

All are effectively zero.

Rates over time

Does the calendar date help explain any of the variance in rates? We’ll investigate by encoding the time difference in days from Jan. 1, 2007.

start_datetime = strptime("2007-01-01", format="%Y-%m-%d")

start_datetime = as.POSIXct("2007-01-01T00:00:00Z", 
                                         format = "%Y-%m-%dT%H:%M:%SZ")

loans$disbursal_start_time = difftime(loans$disbursal_datetime, 
                                       start_datetime, units="days")

pairs.panels(loans[, c(8, 17, 18, 14, 16, 19:22, 23)])

The only real correlation here is with funded_timediff – which would indicate that as the calendar time wore on, it took slightly longer for the average loan to get funded. Nothing else stands out, so incorporating time into any future model of rates doesn’t seem worthwhile.

Summary

We have done a fairly thorough sweep of the pairwise relationship between our variables, including some new rate variables that we created. From here, we could go on to explore multiple variables at once, or we could consider transforming or segmenting variables and re-examining pairwise combinations. Given the number of directions we could move in, perhaps it would be wise to step back and ask what it is we’re even trying to learn or predict – beyond mere “data surprises.”

But first, here is what we have learned:

  • Loan amount is correlated with some expected variables: the number of borrowers, the funding time, the lender count, and the repayment time.
  • The fraction of the borrowing group’s composition that is female is negatively correlated with funding time, albeit weakly.
  • Sector doesn’t tell us much about the characteristics of the loan – not its amount, not its number of borrowers – nor does it explain much about how long a loan will take to get funded or repaid. Some sectors tend to have borrowing groups that are female-only, while others are more mixed-gender.
  • By contrast, the country of the borrower tells us more about a loan’s amount. There is also greater variance within countries when it comes to gender and number of borrowers. When we condition on country, we also see a greater variance by sector than we do unconditionally.
  • Country seems to account for some of the variance in the pipeline variables: number of lenders, funding time, and payback time. To some extent, this may be a conflation of country with loan amount, since these two are correlated.
  • Time series are complicated by the irregular observation intervals. That said, we see average loan amount and number of borrowers grow over time. Sector-wise, retail and agriculture loan amount grow in a fairly stable manner, food less so. Not all countries exhibit growth over the five-year period of the loan data set. Funding and repayment time both trend up, but both display curious maxima.
  • When we consider rates, certain correlations disappear on the lender side. The loan amount, the gender composition of the borrowing group, the borrower’s country, and the borrower’s sector – none of these impact the individual borrower’s average contribution or the amount of money donated by the Kiva lending community per day. The calender time is weakly correlated with the funding time.
  • On the borrower side, we do see some interesting relationships between rates and other variables. Repayment rate is strongly correlated with loan amount and with number of borrowers, and weakly negatively correlated with gender composition. Furthermore, a moderate amount of the variance in borrower rates (USD per borrower) is explained by the borrower’s country, while a smaller amount of the variance in repayment rate is explained by country.

We also encountered some curiosities in the data that we should address before moving on. Specifically:

  • We will remove countries that don’t appear at least 30 times in the data.
  • We will remove loans that have a funding time less than 0.
  • We will remove loans that happened before 2007.
head(hc_loans)
##       id status         sector partner_id lender_count
## 1 334113   paid           Food        183           42
## 2 313179   paid Transportation        145           24
## 3 206002   paid         Retail        145           15
## 4 354610   paid      Education        113           15
## 5 226099   paid    Agriculture        109           10
## 6  78960   paid           Food         70           85
##   description.languages location.country_code terms.loan_amount
## 1                    en                    SL              1175
## 2                    en                    PH               600
## 3                    en                    PH               425
## 4                es, en                    GT               400
## 5                    en                    KH               500
## 6                    en                    PE              2725
##   description.avgchar     posted_datetime     funded_datetime
## 1                 690 2011-09-11 00:40:06 2011-09-29 01:06:26
## 2                 524 2011-07-04 15:20:03 2011-07-08 06:18:03
## 3                1637 2010-07-01 07:00:16 2010-07-15 17:11:28
## 4                 472 2011-11-05 23:50:06 2011-11-06 02:52:45
## 5                 731 2010-10-01 07:00:09 2010-10-02 16:54:39
## 6                1890 2008-12-12 16:30:14 2008-12-16 15:52:58
##         paid_datetime  disbursal_datetime funded_timediff
## 1 2012-09-15 09:19:24 2011-08-15 07:00:00         18.0183
## 2 2012-01-15 11:01:44 2011-06-13 07:00:00          3.6236
## 3 2010-12-15 15:08:24 2010-05-07 07:00:00         14.4244
## 4 2012-05-15 08:52:45 2011-10-10 07:00:00          0.1685
## 5 2011-05-15 10:07:34 2010-07-28 07:00:00          1.4128
## 6 2009-06-15 22:51:05 2008-11-30 08:00:00          3.9741
##   disbursed_timediff paid_timediff b.num b.female     f.rate   l.rate
## 1           -44.7545      397.0968     1        0   65.21148 27.97619
## 2           -24.9709      216.2095     1        1  165.58119 25.00000
## 3           -69.4246      222.3808     1        1   29.46396 28.33333
## 4           -26.8700      218.0783     1        1 2373.88724 26.66667
## 5           -66.4130      291.1303     1        1  353.90713 50.00000
## 6           -16.3284      197.5771    13        1  685.68984 32.05882
##      r.rate    b.rate
## 1  2.958976 1175.0000
## 2  2.775086  600.0000
## 3  1.911136  425.0000
## 4  1.834204  400.0000
## 5  1.717444  500.0000
## 6 13.792084  209.6154
hc_loans2 = hc_loans[hc_loans$funded_timediff >= 0 & hc_loans$paid_timediff >= 0 , ]
hc_loans3 = hc_loans2[hc_loans$posted_datetime > strptime("2007-01-01", format="%Y-%m-%d"), ]

hc_loans3_json = jsonlite::toJSON(hc_loans3)
write(hc_loans3_json, file="hc_loans3.json")